Popular Historic Figures Exploration
by Julian Hernández
Database Information
Pantheon is a project developed by the Macro Connections group at The MIT Media Lab. It is a way to celebrate our acomplishments as a species by documentating our global heritage. You can find more about this dataset and project on Kaggle or Pantheon’s Official Site
This dataset gathers information on the 11,341 biographies that have presence in more than 25 languages in the Wikipedia (as of May 2013). This dataset is not restricted to any cultural domain or time period, including all biographies that are present in more than 25 different language editions of Wikipedia.
The dataset has 17 variables from which I choose 12 to analyze.
## 'data.frame': 11337 obs. of 14 variables:
## $ full_name : chr "Aristotle" "Plato" "Jesus Christ" "Socrates" ...
## $ birth_year : num -384 -427 -4 -469 -356 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ country : Factor w/ 196 levels "","Afghanistan",..: 63 63 80 63 63 81 37 81 180 63 ...
## $ continent : Factor w/ 8 levels "","Africa","Asia",..: 4 4 3 4 4 4 3 4 4 4 ...
## $ occupation : Factor w/ 88 levels "Actor","American Football Player",..: 60 60 75 60 54 45 60 67 88 60 ...
## $ industry : Factor w/ 27 levels "Activism","Business",..: 24 24 25 24 20 14 24 11 15 24 ...
## $ domain : Factor w/ 8 levels "Arts","Business & Law",..: 4 4 5 4 5 7 4 5 4 4 ...
## $ article_languages : int 152 142 214 137 138 174 192 128 141 114 ...
## $ page_views : int 56355172 46812003 60299092 40307143 48358148 88931135 22363652 43088745 20839405 26168219 ...
## $ average_views : int 370758 329662 281771 294213 350421 511098 116477 336631 147797 229546 ...
## $ historical_popularity_index: num 32 32 31.9 31.7 31.6 ...
## $ antiquity : num 2397 2440 2017 2482 2369 ...
## $ hpi : num 32 32 31.9 31.7 31.6 ...
Variable Description
full_name: Name of the historical figure
birth_year: Birth Year of the historical figure in the Gregorian Calendar
sex: Biological Sex of the historical figure (Male / Female)
country: Country or modern day equivalent where the historical figure was born.
continent: Continent where the historical figure was born.
occupation: Profession the historical figure had.
industry: Industry of the historical figure.
Domain: Domain of Knowledge where the historical figure excelled.
article_languages: Number of Languages the historical figure biography is present in Wikipedia.
page_views: Total number of pageviews in all languages.
average_views: Total Number of pageviews divided by number of languages the biography is in.
HPI or historical_popularity_index: A more complex way to measure the historical impact by using the number of languagues, page views, age of the character, and the variation in pageviews per language.
## full_name birth_year sex country
## Length:11337 Min. :-3500 Female:1495 United States :2169
## Class :character 1st Qu.: 1791 Male :9842 United Kingdom:1147
## Mode :character Median : 1919 France : 866
## Mean : 1658 Italy : 809
## 3rd Qu.: 1961 Germany : 748
## Max. : 2005 Unknown : 433
## (Other) :5165
## continent occupation industry
## Europe :6366 Politician :2528 Government :2703
## North America:2439 Actor :1193 Film And Theatre:1374
## Asia :1188 Soccer Player :1064 Team Sports :1230
## Africa : 419 Writer : 953 Music :1054
## Unknown : 406 Religious Figure: 517 Language : 998
## South America: 366 Singer : 437 Natural Sciences: 736
## (Other) : 153 (Other) :4645 (Other) :3242
## domain article_languages page_views
## Institutions :3453 Min. : 26.00 Min. : 1965
## Arts :2866 1st Qu.: 29.00 1st Qu.: 628928
## Sports :1756 Median : 35.00 Median : 1603951
## Science & Technology:1366 Mean : 40.77 Mean : 4202224
## Humanities :1328 3rd Qu.: 46.00 3rd Qu.: 4485693
## Public Figure : 358 Max. :214.00 Max. :145250649
## (Other) : 210
## average_views historical_popularity_index antiquity
## Min. : 49 Min. : 9.879 Min. : 8.0
## 1st Qu.: 18442 1st Qu.:20.432 1st Qu.: 52.0
## Median : 43871 Median :23.027 Median : 94.0
## Mean : 89439 Mean :22.308 Mean : 354.7
## 3rd Qu.: 107243 3rd Qu.:24.589 3rd Qu.: 222.0
## Max. :1515232 Max. :31.994 Max. :5513.0
##
## hpi
## Min. : 9.879
## 1st Qu.:20.432
## Median :23.027
## Mean :22.308
## 3rd Qu.:24.589
## Max. :31.994
##
Is there any guiding principle for historical popularity, maybe location, profession or time period? Which professions are more historically memorable or significant? Is there any difference in the behavior of men and woman in the data? How does location influences historical popularity? Is there any difference in the occupations that are remembered per location?
Year of Birth
Year of Birth: Most globally notable figures were born on recent times. Nevertheless there is huge variability, dates range from -3500 BC to 2005 AC.
The data follows an upward trend towards recent times but it has a dip on the most recent years (1995 - 2010). Which could be due to the fact that people born on those years are fairly young and still need more time to develop their careers and skills. It could be also due to the specialization of skills needed to work on contemporary times, which makes teams more necessary and teams do not appear on the dataset.
Sex:
There is a huge difference in the amount of men and women in the dataset. In this exploration it wont be possible to explain this imbalance, due to the fact that we are not looking at the complete dataset of biographies in wikipedia but a subset, the ones translated to more than 25 languages.
Location
In the past graphs we can see the distribution of great people among continents and countries.
In terms of continents, most notable people come from Europe, followed by North America and then Asia. In terms of countries, the United States is followed by several european countries ( France, Germany, the UK and Italy).
This reflects Wikipedia bias, a bias the authors of the dataset identify as an issue. Wikipedia is more popular in Western countries where European history and figures are studied and remembered. Continents with history that spans millenia and that have huge populations such as Asia or Africa are lacking in representation.
Industry & Domain
Here we have the profession of the notable people on the dataset in decreasing levels of granularity. We can see their field divided by: occupation, industry and domain.
The most common occupation in the dataset is that of politician, followed by several profession in the entertainment industry, football player, actor and writer. This translates to the distribution of great people by industry and domain, where most of them are located in goverment, the Arts and sports.
Some small bits that surprised me is the amount of religious figures that are well know and remembered. I also though that a career on the entertainment industry would be the leading occupation and that entertainment would be the leading domain. Since Wikipedia is volunteer run and celebrities tend to be way more popular than any politician. Although politician do have an historical advantage, politics as an occupation has existed for a long time.
Number of Languages
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.00 29.00 35.00 40.77 46.00 214.00
The number of languages is skewed to the left with more than 75% of the dataset being translated to 46 languages or less. There are several outliers, like, the biggest value in the dataset, 214 languages, that belongs to the article of Jesus Christ, who is followed by Barack Obama (200 languages) and US actor Corbin Bleu (193).
We can clearly see a downward exponential trend with languages.
Total Views & Average Views
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1965 628928 1603951 4202224 4485693 145250649
Page views are higly skewed to the right, using log10 its possible to see a better representation of the data how it peaks around 1.000.000 views and start declining again at 10.000.000. We can see that happening using the median, 50% of the values are below 1.000.000 and 75% of the values are below 4.000.000.
Nevertheless page views is filled with outliers, which are people that have changed the course of history or impacted their fields with a breath of fresh ideas. The poeple with most views on Wikipedia, on this dataset are: Michael Jackson, Adolf Hitler and Justin Bieber. An unlikely trio.
Average Views
Average Views is obtained by dividing the total number of page views by the amount of languages the article is available.
Average Page Views behaves similar to total page views, is skewed to the right and when log10 is applied it normalizes its distribution. Nevertheless it doesnt’t measure the same as total page views. What average page views tell us is who is consistently more popular across several languages, but it is a mean measurement which still means that it is susceptible to large values inflating it. Perhaps a median would have been more useful to know the distribution of the page views.
The biographies with the highest Average Page Views count are Kim Kardashian, Lil Wayne and Eminem. Which could either mean that Kim has a large international audience or that few specific countries drive their views up, so that they have a high page view number in comparison to the amount of languages their bio is translated to.
Historical Popularity Index (HPI)
The HPI is calculated taking into account several key indicators. Such as the time passed since the historical figure lived, number of times their biography has been visited,the number of views in different languages and other key factors.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.879 20.432 23.027 22.308 24.589 31.994
We can see it’s distribution resembles a normal one, but it a little skewed to the right. With most values being in the 20 - 24 range.
There are 11337 notable people in the dataset with 12 features (full_name, birth_year, sex, country, continent, occupation, industry, domain, article_languages, page_views, average_views, historical_popularity_index). There are several factors: continent, sex, industry and domain. But none of them are ordered.
Other notable details of the variables are:
- A huge disparity between the quantity of men and woman.
- Most notable people come from Europe and worked in goverment institutions.
- The most common occupation is politician
- The country with the most great people is the United States.
- Most people on the dataset have between 628928 and 4485693 views.
The features in the data set I’d like to explore are gender and the historical_popularity_index. I’d like to determine the relationship between these variables with the rest the dataset. I’d also like to know if there are variables that predict “historical relevance” and how good of a predictors are they.
I think domain, birth year, continent and the amount of languages the biography is translated to; might give an insight to the kind of historical figure that becomes popular on Wikipedia.
No, I only used existing variables.
Most of the numeric variables are skewed to either the right or the left, most of them have lots of extreme values. In some of the variables, like page_views and average_views, I used log10 to transform the data so its distribution could be more easily seen and it could be more easily managed.
I also transformed birth_years from a factor variable to a numeric variable. It was what made the most sense in order to visualize and analyze it.
I also dropped variables such as longitude, latitude. The information they gave is pretty specific and is not something that a variable like continent or country wouldn’t tell me. I also dropped the state variable due to it’s specificity, it only applied to the US.
There are two variables in the data I want to explore, Sex and HPI. During this exploration I will search for relationships between the other variables and those elements.
Sex:
Females tend to have more views than males on both, total views and average. But, men tend to have more outliers on the upper part of the boxplot. The only boxplot with a different behavior is the one for historical popularity, here men tend to have higher values than women, but men tend to have more outliers, just that in this case most of the outliers are on the lower part of the boxplot.
Both Male and Females have similar distributions of languages in their biographies.
Birth year
We see the same explosion after 1750 in both sexes, but women lag behind for 250 years before seeing a truly significant growth. More women are known in recent times (1750 & up) than in the rest of human history. Nevertheless men have way more records than women, as we had seen in our previous analysis.
We also see the same behavior that we observed before on both genders. Both of them rise until recent times where they have a small dip.
Location
Female historical figures were born mostly in the United States and follow a similar distribution to the general view we explored before. In other words, most of them where born in european countries or the US. One relevant difference between the two sexes is that most of the figures with Unknown nationality are men, not women.
On the continent scale, Europe is barely holding to the first post in terms of women notable figures, North America is quickly catching up to them.
The rest of the world, Africa, South America and OCeania, have almost no women notable figures.
Profession
At first glance we can see that among the less populated occupations in the dataset the division between genders gave us a clear divide. Some of the occupations in where this divide is evident are: gymnast, pornographic actor, model or companion tend to be roles fulfilled by women, there are few notable men on these roles. Meanwhile men dominate in roles such as historian, inventor, explorer or composer. Popular occupation with few notable women.
Most notable women work in the Arts, specifically as actresess. Which we can see reflected on the distribution on the industry and domain fields.
Here we see a big departure from the distribution we saw on the Domain univariate analysis. The distribution of domains for women are remarkably different than the one from men or the general one.
Bivariate Analysis: Gender Exploration
Talk about some of the relationships you observed in this
part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
First of all, there were some differences in occupations between men and women. Most notable men worked in institutions while women worked in the Arts, especifically as actresses. The most popular profession for men was politician.
In terms of location, most notable women came from the Unites States. With several European countries at it’s tail. Looking at the data from a continent perspective we see that Europe is the place where most notable women are born, followed by the US, Asia and a lack of figures coming from the rest of the world. Men location distribution is similar to the general one.
Also, A big difference comes in terms of popularity, women tend to have more pageviews,in total and in average, than men. But men tend to be have a higher HPI than women. Both have similar distribution of languages.
Through this exploration we could see the most common country for notable women to come from and which field would a woman most commonly work in. Nevertheless that is only the beginning. I would like to know from if there are relationships between views, country and occupation. For example:
- Is a Male French artist more popular than an American one? How about a Female one?
- Which is the most common occupation for notable women per country? How about by continent? And for men?
- What kind of occupation receives the most page views on Wikipedia?
HPI
Page Views & HPI
##
## Pearson's product-moment correlation
##
## data: db$hpi and db$page_views
## t = 10.046, df = 11335, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0756611 0.1121524
## sample estimates:
## cor
## 0.0939383
An initial analysis of the relationship between page views and HPI, seems to point to a lack of relationship between the two, The scatterplot and the pearson correlation index also guide us in that direction.
In order to explore the relation of the page view variable with the HPI a bit more I decided to use log10 to normalize the distribution of page view.
##
## Pearson's product-moment correlation
##
## data: db$hpi and log10(db$page_views)
## t = 5.9375, df = 11335, p-value = 2.978e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03731294 0.07401490
## sample estimates:
## cor
## 0.05568273
After applying log10 to the variable we see that there is still no clear relationship between a HPI and page views.
I double checked using the pearson correlation index, which is lower than it was before, almost 0, meaning “No relation”.
Average Views
##
## Pearson's product-moment correlation
##
## data: db$hpi and db$average_views
## t = -9.5395, df = 11335, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.10747533 -0.07095238
## sample estimates:
## cor
## -0.08924385
I decided to run a similar process in average views as I ran in page_views. On the scatterplot we can se a lack of relationship between the variables. An hypothesis that is confirmed by the pearson correlation index, -0.08, which indicates a lack of relationship.
I run the test again with log10, to see if normalizing the distribution of the variable helps.
##
## Pearson's product-moment correlation
##
## data: db$hpi and log10(db$average_views)
## t = -7.1522, df = 11335, p-value = 9.069e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08532955 -0.04867882
## sample estimates:
## cor
## -0.06702679
Just as with page_views, using log10 makes the pearson coefficient go down.
Both average and total page views seem to have little relevance in the importance of an historical figure.
Birth Year
In order to explore correlation between hpi and birth year I tranformed the birth year variable to avoid negative numbers using the following formula: antiquity = 2013 - birth_year.
We substract from 2013 since the dataset was updated last on May 2013.
Using the scatterplot we can see that recent great people then to have the most variance in HPI and older historical figures tend to have higher HPI. On the other side recent notable people are the only ones in the dataset with extremely low HPI values.
##
## Pearson's product-moment correlation
##
## data: db$hpi and log10(db$antiquity)
## t = 109.33, df = 11335, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7073565 0.7252787
## sample estimates:
## cor
## 0.7164358
Using the pearson correlation index we can see that antiquity and HPI are related. It could be one of the potential variables used to predict HPI.
Average Languages
There seems to be a vague positive relationship between HPI and # of Languages. The scatterplot is too disperse. I used pearson to check the relationship between the two.
##
## Pearson's product-moment correlation
##
## data: db$hpi and db$article_languages
## t = 52.566, df = 11335, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4277944 0.4573965
## sample estimates:
## cor
## 0.4427161
Languages and HPI have a high correlation index in comparison to page views or average views. Nevertheless 0.4 is still too low to consider a strong relationship between the two.
Profession & Domain
In professions there are winners there losers. In the past graphic we can see are occupations that never rise above the median HPI, the worst case is the one of swimmers which fall behind every other profession. On the other side, The profession with the highest median HPI is philosophy.
Other low performing occupations include: gymnast, skater, skier or tennis player.
Other high performing occupations are: pirate, public worker or explorer.
In the past graphs we can see the picture painted in the occupation boxplots come to life.
Industries related with sports do poorly on the HPI. Meanwhile, philosophy, history and fine arts are the ones with the highes HPI.
I have to point out that even though Fine Arts is doing great other fields of the arts are not that lucky: Music, Film & Theater and Media Personality are some of the worst performing industries.
Finally it all comes together on domain, the Humanities (where philosophy is located) soar along with Institutions. While the Arts and Sports are the fields with the lowest HPI.
Location
Most countries fall under the median HPI line. With the country with the lowest HPI being Swaziland. The countriy with the highest HPI shouldn’t come as a surprise since their historical colaborations are well know. It is also known as the birthplace of Western Civilization, Greece.
It also important to point out the behavior of the countries with the most notable people, the US, UK and France. The UK and the US have their respective medians below the global one and have lots of outliers in either side. Meanwhile, France behaves similarly to Germany and Italy. They have HPI higher than the global but also a lot of outliers pulling them down.
Finally I have to point out that Unknown country has one of the highest HPIs.
In terms of domain we see the past behavior reinforced. Most continents have low HPI levels only Europe, Asia and Unknown have median higher than the global median.
I would also like to point out that North America has most of its values below the mean.
Sex
We analyzed the relationship between sex and HPI before. Men tend to have higher HPI values than women.
## db$sex: Female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.879 18.328 21.238 20.889 23.512 30.037
## --------------------------------------------------------
## db$sex: Male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.25 20.90 23.20 22.52 24.70 31.99
General
## [1] "antiquity" "domain" "continent"
## [4] "article_languages" "sex" "page_views"
## [7] "average_views" "hpi"
the dataset?
They main takeaway of the exploration of HPI was the lack of relationship between it and most variables. The exception being, Antiquity that has a strong correlation with HPI, and # of Languages.
In terms of categorical variables of factors. In the next phase of the exploration I will only use the less granular format, domain for occupations and continent for countries.
The Continent variable has a couple of surprises, Unknown is the highest median on the dataset followed by Europe or Asia. Surprisingly the US lacks behind. Continents like South America, Africa and Oceania have the lowest median overall.
In terms of profession we can see that most domain variables have a median that is similar to the global one. With the exception of Sports that is way below and Humanities & Institutions that are above the global median.
In this categorical variables we saw that the HPI distribution between factors is not equal, some have higher or lower HPIs. This proves helpful to our modelling ambitions.
(not the main feature(s) of interest)?
Yes, average_views and page_views have a high correlation index.
Also there is a curious relation between views, antiquity and hpi. Just as with HPI, recent notable figures tend to have more variance but as a figure gets older they tendto receive less views. Which is the opossite of what happens with HPI, as a figure gets older it becomes more historically relevant.
Sex:
I come to this part of the analysis with questions already in mind, like: - Which profession domain recieve more visits depending on gender?
- Which continent has more notable women? - Which is the most common
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
Sex, Occupation and Popularity.
As we had seen from our analysis before, Women tend to have more views than men. Now we see where does views concentrate.
Women receive more significantly more pageviews in the following occupations: Outlaws, Dance or Design. These are fields where women are more popular than men in most cases. Nevertheless, Women receive more pagevies than their male counterparts in almost every industry but men tend to have more outliers after the 75% quantile. Which means, women as a group tend to be more popular, but the notable people with the most page views in an specific industry with the most page views is likely to be male.
We see a similar behavior on the domain variable. Women tend have more page views than men in most domains except in Sports and institutions. But We see the same behavior with outliers that we saw before. Although men as a group tend to be less popular than women as a group, men notable figures tend to be more viewed. The exception to this happens on the Arts and Exploration domain, where women tend to have more views and there are few or no male outliers.
Sex, Continent and Page Views
Here we see the same behavior that we saw on the location variables, Women tend to be have a higher median and 75% quantile but men have more ouliers which end up meaning that the most viewed people on each continent are likely men. Only North America and Oceania behave differently, they have a higher median and 75% quantile with few or no male outliers.
Sex, Location, Views and Occupation
Lets start by dissecting each domain by the number of page views they get on each continent.
Arts:
The first thing that pops up is the lack of women artist whose birth location is unknown. Second of all, notice that Europe has the same behavior observed before women tend to receive more views but men have way more outliers with high page view amounts. Other than that we see that in Asia, Africa and North America women tend to have more slightly more views. Artist from Oceania and South America tend to have more pageviews if they are male.
In other words the most popular male artists will surely come from Europe and the female artist with more pageviews will come from North America.
Business & Law:
Women sure got the short straw on this domain. Most continents don’t have notable business women and the ones that do, North America, Asia and Europe, receive way less page views than their male counterparts in those continents.
Exploration:
The behavior is very similar to Business & Law with the addition of African Explorers.
Humanities:
Perhaps the most eye catching fact of this section is the high amount of views that South American women tend to have. Other than that we see the a similar behavior to other fields of knowledge, women tend to have slightly more views, men have more outliers with higher values.
Institutions:
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!